Statistics for Data Science II
Let us now review the “checks” we will perform on our models.
Model assumptions
Outliers
Influence
Leverage
Multicollinearity
Recall the glm, y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... \beta_k x_k + \varepsilon where \varepsilon is the residual error term.
Recall that the residual error is defined as \varepsilon = y - \hat{y} where
y is the observed value
\hat{y} is the predicted value
The residual tells us how far away the observation is from the response surface (the predicted value).
Ordinary least squares estimation means that we have minimized the overall error.
\varepsilon \overset{\text{iid}}{\sim} N(0, \sigma^2)
The assumptions are on the residual!
What this means:
Residuals are normally distributed
Distribution of residuals is centered at 0
Variance of residuals is some constant \sigma^2
We will assess the assumptions graphically.
Constant variance: scatterplot
Normal distribution: q-q plot
A package was written by a former student, classpackage.
If you are working on the server, the package is already installed.
If you are not working on the server, you need to install the package:
anova_check() function.library(tidyverse)
library(fastDummies)
colony <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv') %>%
mutate(quarter = case_when(months == "January-March" ~ "Q1",
months == "April-June" ~ "Q2",
months == "July-September" ~ "Q3",
months == "October-December" ~ "Q4")) %>%
dummy_cols(select_columns = c("quarter")) %>%
select(quarter, year, state, colony_lost, quarter_Q1, quarter_Q2, quarter_Q3, quarter_Q4, colony_max)
head(colony, n=5)m1 <- glm(colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + year + colony_max, data = colony)
summary(m1)
Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 +
year + colony_max, data = colony)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.274e+05 2.203e+05 1.940 0.0526 .
quarter_Q1 5.622e+02 5.785e+02 0.972 0.3314
quarter_Q2 -3.496e+03 5.987e+02 -5.839 6.84e-09 ***
quarter_Q3 5.952e+02 5.977e+02 0.996 0.3196
year -2.119e+02 1.092e+02 -1.941 0.0525 .
colony_max 1.167e-01 1.086e-03 107.466 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 49277184)
Null deviance: 6.2772e+11 on 1149 degrees of freedom
Residual deviance: 5.6373e+10 on 1144 degrees of freedom
(72 observations deleted due to missingness)
AIC: 23641
Number of Fisher Scoring iterations: 2
If we want to know how well the model fits the particular dataset at hand, we can look at the R^2.
R^2 is the proportion of variance explained by the model.
Because it is a proportion, R^2 \in [0, 1] and is independent of the units of y.
If R^2 \to 0, the model does not fit the data well; if R^2 \to 1, the model fits the data well.
R^2 = \frac{\text{SSReg}}{\text{SSTot}}
Remember that we are partitioning the variability in y (SSTot), which is constant, into two pieces:
The variability explained by the regression model (SSReg).
The variability explained by outside sources (SSE).
As predictors are added to the model, we necessarily increase SSReg / decrease SSE.
We want a measure of model fit that is resistant to this fluctuation, R^2_{\text{adj}} = 1 - \left( \frac{n-1}{n-k-1} \right) \left( 1 - R^2 \right), where k is the number of predictor terms in the model.
We will use the adjR2() function from the glmtoolbox package to find R^2_{\text{adj}}.
In our example,
Definition: data values that are much larger or smaller than the rest of the values in the dataset.
We will look at the standardized residuals, e_{i_{\text{standardized}}} = \frac{e_i}{\sqrt{\text{MSE}(1-h_i)}}, where
If |e_{i_{\text{standardized}}}| > 2.5 \ \to \ outlier.
If |e_{i_{\text{standardized}}}| > 3 \ \to \ extreme outlier.
We will use the rstandard() function to find the residuals.
For ease of examining in large datasets, we will use it to create a “flag.”
colony <- colony %>%
mutate(outlier = if_else(abs(rstandard(m1))>2.5, "Suspected", "Not Suspected"))Error in `mutate()`:
ℹ In argument: `outlier = if_else(abs(rstandard(m1)) > 2.5, "Suspected",
"Not Suspected")`.
Caused by error:
! `outlier` must be size 1222 or 1, not 1150.
Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 +
year + colony_max, data = colony)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.274e+05 2.203e+05 1.940 0.0526 .
quarter_Q1 5.622e+02 5.785e+02 0.972 0.3314
quarter_Q2 -3.496e+03 5.987e+02 -5.839 6.84e-09 ***
quarter_Q3 5.952e+02 5.977e+02 0.996 0.3196
year -2.119e+02 1.092e+02 -1.941 0.0525 .
colony_max 1.167e-01 1.086e-03 107.466 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 49277184)
Null deviance: 6.2772e+11 on 1149 degrees of freedom
Residual deviance: 5.6373e+10 on 1144 degrees of freedom
(72 observations deleted due to missingness)
AIC: 23641
Number of Fisher Scoring iterations: 2
m2 <- glm(colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + year + colony_max, data = colony2)
summary(m2)
Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 +
year + colony_max, data = colony2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.274e+05 2.203e+05 1.940 0.0526 .
quarter_Q1 5.622e+02 5.785e+02 0.972 0.3314
quarter_Q2 -3.496e+03 5.987e+02 -5.839 6.84e-09 ***
quarter_Q3 5.952e+02 5.977e+02 0.996 0.3196
year -2.119e+02 1.092e+02 -1.941 0.0525 .
colony_max 1.167e-01 1.086e-03 107.466 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 49277184)
Null deviance: 6.2772e+11 on 1149 degrees of freedom
Residual deviance: 5.6373e+10 on 1144 degrees of freedom
AIC: 23641
Number of Fisher Scoring iterations: 2
A leverage point is defined as follows:
An influential point is defined as follows:
We check these together using Cook’s distance.
We will use the cooks() function from the classpackage package.
Collinearity/multicollinearity: a correlation between two or more predictor variables affects the estimation procedure.
We will use the variance inflation factor (VIF) to check for multicollinearity. \text{VIF}_j = \frac{1}{1-R^2_j},
where
We say that multicollinearity is present if VIF > 10.
How do we deal with multicollinearity?
Easy answer: remove at least one predictor from the collinear set, then reassess VIF.
More complicated: discussing with collaborators/bosses.
vif() function from the car package.Note: the car package overwrites some functions from tidyverse, so I typically do not load the full library.
There will be a VIF value for each predictor in the model.
quarter_Q1 quarter_Q2 quarter_Q3 year colony_max
1.574555 1.525640 1.520629 1.013434 1.000756
We can perform sensitivity analysis to determine how much our model changes when we exclude the outliers.
Questions we will ask:
We only look at sensitivity analysis once (i.e., only remove data points once for reanalysis).
m2 <- glm(colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + year + colony_max, data = colony2)
summary(m2)
Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 +
year + colony_max, data = colony2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.274e+05 2.203e+05 1.940 0.0526 .
quarter_Q1 5.622e+02 5.785e+02 0.972 0.3314
quarter_Q2 -3.496e+03 5.987e+02 -5.839 6.84e-09 ***
quarter_Q3 5.952e+02 5.977e+02 0.996 0.3196
year -2.119e+02 1.092e+02 -1.941 0.0525 .
colony_max 1.167e-01 1.086e-03 107.466 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 49277184)
Null deviance: 6.2772e+11 on 1149 degrees of freedom
Residual deviance: 5.6373e+10 on 1144 degrees of freedom
AIC: 23641
Number of Fisher Scoring iterations: 2
colony_no_outliers <- colony2 %>% filter(outlier == "Not Suspected")
m3 <- glm(colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 + year + colony_max, data = colony_no_outliers)
summary(m3)
Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 +
year + colony_max, data = colony_no_outliers)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.192e+05 1.117e+05 1.963 0.0499 *
quarter_Q1 -8.762e+01 2.929e+02 -0.299 0.7649
quarter_Q2 -1.810e+03 3.051e+02 -5.933 3.97e-09 ***
quarter_Q3 2.071e+02 3.026e+02 0.684 0.4938
year -1.087e+02 5.535e+01 -1.963 0.0499 *
colony_max 1.121e-01 8.059e-04 139.035 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 12331684)
Null deviance: 2.5457e+11 on 1117 degrees of freedom
Residual deviance: 1.3713e+10 on 1112 degrees of freedom
AIC: 21435
Number of Fisher Scoring iterations: 2
Today we have covered the basics of
model assumptions
model diagnostics
sensitivity analysis
Next week, we will learn about a new type of term: an interaction.
Continuous \times continuous
Continuous \times categorical
Categorical \times categorical